1 Introduction

An ongoing topic in all levels of education is the influence of socioeconomic factors on academic performance. This starts from infancy, but could extend through all years of life.

In 2021, the publication “Ongelijkheid in het Nederlandse onderwijs door de jaren heen” (eng., Inequality in Dutch education over the years) highlighted the social inequality in educational achievement. In the process of going through the article, it becomes evident that there exists a substantial difference between students from less privileged and more affluent social backgrounds in their educational achievement. This difference is reflected in both primary and secondary school academic results (Diris, 2022).

As mentioned above, there is a clear distinction between the educational achievements between students from different backgrounds in the Netherlands. Several factors may come into play here.

One factor not considered here is the Dutch school system. The Dutch school system consists of 8 years of elementary school, in which the last year an advice is given for the level of secondary school of the student in question. There has been criticism of this system, including by Boztas (2023): Education or ‘selecting the new elite’?, is the title of his article. Boztas raises concerns about the current school system in this article. That determines what level of secondary education they will follow: VWO, Havo or VMBO comes too early in a child’s development.

The Dutch Education Council agrees, they recommend delaying the moment of selection until after a broad three-year bridge period. This gives students more time for development (Ministerie van Onderwijs, Cultuur en Wetenschap, 2023). Students who come from a lower social background and, for example, need more time to learn the language, but are not less intelligent than his peers.

In Portugal, a comparable system is in place. Elementary school spans nine years, with the final three years comprising lower secondary education. At the age of 15, students transition to secondary education, where a choice will be made between further studies or dual certification, academic and vocational (European Commission, 2023).

The question is whether this system can reduce the difference in educational achievement between students from lower and higher social backgrounds. Or whether in Portugal, akin to the Netherlands, there exists a substantial difference between students from lower and higher social backgrounds in educational achievement. The results of this research will be socially relevant.


Hence, this research will center on the following research question: To what extent do socioeconomic factors, including demographic characteristics, educational context, environmental conditions, parental education and occupation, and educational and social support , impact the academic performance of high school students in Portugal?


To load specific packages in R, we use the function  library()  as follows:

library(ggplot2)       # for visualizing our data in R 
library(naniar)        # for handling missing values 
library(dplyr )        # for ... 
library(reshape2)      # for ... 
library(psych)
library(GGally)        # for the ggcorr() function
library(rpart)         # for the CART algorithm 
library(rpart.plot)    # for plotting the decision trees 
library(C50)           # for the C5.0 algorithm 
library(randomForest)  # for the Random Forest algorithm 
library(plyr)          # for the 'mutate' function
library(liver)         # for the find.na 
library(forcats)       # for the "fct_collapse" function
library(Hmisc)         # for replacing missing values

set.seed(1)

2 Education in Portugal

Education in Portugal is mandatory from the age of 6 up to the age of 18. This system is divided into three cycles, the first cycle consists of 4 years, the second cycle spans 2 years and the last cycle lasts for 3 years as mentioned. The set up for the school curriculum is to make sure that everyone gets a basic education. This form of education gives important knowledge and skills, which can be used for more learning in the future (European Commission, 2023). The focus of this research will be on the last cycle.

The research will be conducted around the academic achievements of students in secondary school, the last cycle, following the course math. Their grades will take a value of 1 to 20, where 10 is equivalent to a passing grade. The inspection will be taken if there will be a correlation/relation between the predicting variables and the outcome variable ‘Final grade’.


3 Socioeconomic determinants of academic performance

To get a better view on the relationship between socioeconomic status and academic performance (in Portugal), this research will explore 5 socioeconomic determinants of grades.


3.1 Demografic characteristics

Variables addressed in this section: school, sex and age.

A worldwide phenomenon, women out-perform men in education. This happens not only in primary school, secondary school or college, but also in higher education. The reasons for this can vary greatly by socioeconomic context (Ullah & Ullah, 2019). The question is whether this conclusion can also be drawn in the context of this research, in Portugal.

As mentioned earlier, secondary school in Portugal, as we know it in the Netherlands, only starts at age 15 and lasts three years. Thus, the dataset of this study will be based mainly on students aged 15 to 18. The students older than this age limit will in all likelihood have failed a class at least once. Perhaps it can be concluded that this group therefore has more difficulty with this subject. This could lead to lower grades than its younger peers.


Hypotheses to be tested:

  • H1: Female students achieve higher grades compared to their male peers.
  • H2: Younger students achieve higher grades compared to their older peers.


3.2 Educational context

There are also a number of personal factors that can contribute to academic performance. Factors that may be related to this are study time and the ambition to take higher education. As mentioned earlier, under the heading “Extra school support”, students who are eager to attend a university are likely to put a lot of effort into it. This should be reflected in the weekly study time. After which this effort should be transposed into better grades. For the named factors, the positive link to grades can also be inferred independently.

Which link is directly opposite is student absence and the number of past class failures. The common intuition is that the more often someone is absent, the more material that student misses and thus has less preparation for the exam. Which increases the likelihood of a lower grade. This intuition is confirmed by Leon (2018), that absences negatively affect the grades. As a result, these lower grades increase the likelihood of someone failing the class. In addition, an inference can be drawn from failing a class, that these individuals may have more difficulty in that particular subject. So the probability of lower grades for these students is also higher.


Hypotheses to be tested:

  • H3: Students that dedicate more time to weekly study achieve higher grades compared to those with fewer weekly study hours.
  • H4: Students with a high number of past class failures attain lower academic performance in comparison to those with fewer past class failures.
  • H5: Students with a high number of school absences attain lower academic performance in comparison to those with fewer school absences.
  • H6: Students with aspirations for higher education achieve superior academic performance compared to those without such aspirations.


3.3 Environmental conditions

Address is fairly vague at the moment, this can take two values in this research. These values are rural or urban. Which of the two will generally be more intelligent and thus in all likelihood perform better in school? A study from America suggests that people with higher intelligence are more likely to live in an urban area than in a rural area (Hayes, 2014). Will this be the same case for the people in Portugal?

Another influential environmental variable is whether or not someone has access to the Internet. The Internet can be of great use in education. Solving inquiries is much easier using the Internet. Thus, access to the Internet can influence a student’s results. This is also reflected in the studies of Amponsah et al. (2022), where students who had access to the Internet made greater progress in academic performance than without it.


Hypotheses to be tested:

  • H7: Students with access to the internet achieve higher grades than those without access to internet.
  • H8: Students that reside in an urban area perform academically better than those residing in a rural area.


3.4 Parental education and occupation

Parents are our first teachers in life, whether it is taking care of yourself, relationships or education. It all starts with our parents. They are the first to give us something to learn in our lives. These first life lessons will not be the same for all families. For example, how someone is raised and what they do or do not inherit from their family can be very different.

What plays an important factor in this is the education of the parents. The link between parental education and academic performance has been studied numerous times. Similarly, Idris et al. (2020) shows in his study that there is a significant difference between the results of students with highly educated parents and the average result. Whereas the pass rate of the students with highly educated parents is 87.67%, it differs significantly from the average which is a pass rate of 72.15%. In his study, he shows that the lower the students’ parents are educated, the lower the student’s success rate. In other words the academic performance of students with highly educated parents are generally considered better. This research aims to test this conclusion with the chosen dataset.

Education of parents is not the only thing that can affect academic performance, the occupation of parents can certainly play a role in this as well. Today, there are countless occupations that a parent can take on. Thus, it will also be nearly impossible to test the influence of each occupation on the academic performance of their children. This research will be focused on a select number of occupations, just as in the study by Owuor, Simatwa, and Ndolo (2022) where it was interpreted that parents with occupations requiring higher education influence academic performance more positively, than the other occupations. This research will examine this relationship between parents with higher education occupations and students’ academic performance. Thus, the following hypotheses will be tested:


Hypotheses to be tested:

  • H9: Students with parents having higher levels of education achieve higher grades compared to students with parents having lower levels of education.
  • H10: Students with parents having higher-educated occupations achieve higher grades compared to students with parents having lower-educated occupations.


3.5 Educational and social support

School support can come in multiple forms and sizes, as was the case in the dataset of this research. Extra educational school support, family educational support or extra paid classes are the variables included in the dataset. The motivation behind these extra educational supports can vary widely.

For instance, extra educational school support is usually encouraged from within the school itself. Here the emphasis is placed on students who need this extra help to pass the subject, that is, usually students with lower grades (Sara, 2023).

While family educational support is not so much present due to lesser academic results, but often comes more from the general encouragement of parents to perform in school. When parents of students are involved in school this has positive results not only in school but also on the child’s overall development according to The Annie E. Casey Foundation (2022). They indicate in their study that these students are less likely to suffer from low self-esteem or develop behavioral issues, but also that they are more likely to achieve higher grades. This research wants to examine the latter, whether students with family educational support generally score higher than students who do not experience it.

Finally, there is the extra paid classes variable. The reason behind this can come from several points of view. The student underperforms (and may need extra paid classes in addition to the extra educational school support) or the student aspires to higher grades and wants to be admitted to one of the better universities, for that reason the student in question takes these extra paid classes. What is certain, is that these classes cost money, as the variable indicates. Not everyone has this luxury, usually only those with social economic status.


Hypotheses to be tested:

  • H11: Students who receive extra educational school support achieve higher grades than those who do not receive such additional educational support.
  • H12: Students who receive family related support achieve higher grades than those who do not receive such support.
  • H13: Students that take extra paid classes achieve higher grades than those who do not take such classses.


“~/Desktop/Data Wrangling/Main project/student-mat.csv”

4 Data importation, tranformation and comprehension

To import the CSV data in R , we use the function read.csv() as follows:

data <- read.csv("~/Desktop/student-mat.csv")

To see the overview of the dataset in R, we use the function str() as follows:

str(data)  # offers a concise overview of the data set's structure
  'data.frame': 395 obs. of  33 variables:
   $ school    : chr  "GP" "GP" "GP" "GP" ...
   $ sex       : chr  "F" "F" "F" "F" ...
   $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
   $ address   : chr  "U" "U" "U" "U" ...
   $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
   $ Pstatus   : chr  "A" "T" "T" "T" ...
   $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
   $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
   $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
   $ Fjob      : chr  "teacher" "other" "other" "services" ...
   $ reason    : chr  "course" "course" "other" "home" ...
   $ guardian  : chr  "mother" "father" "mother" "mother" ...
   $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
   $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
   $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
   $ schoolsup : chr  "yes" "no" "yes" "no" ...
   $ famsup    : chr  "no" "yes" "no" "yes" ...
   $ paid      : chr  "no" "no" "yes" "yes" ...
   $ activities: chr  "no" "no" "no" "yes" ...
   $ nursery   : chr  "yes" "no" "yes" "yes" ...
   $ higher    : chr  "yes" "yes" "yes" "yes" ...
   $ internet  : chr  "no" "yes" "yes" "yes" ...
   $ romantic  : chr  "no" "no" "no" "yes" ...
   $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
   $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
   $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
   $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
   $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
   $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
   $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
   $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
   $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
   $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...


The data set currently comprises of 32 predictor variables along with the target variable grade. To ensure our focus remains on the most relevant factors, we may need to prune or remove some of the insignificant variables in order to enhance the efficiency of our analysis.


4.1 Transformation of data

To emphasize our focus on the socioeconomic factors influencing students’ grades, we will eliminate variables that are not relevant to this investigation. The following variables will be excluded from our data set: famsize, Pstatus, reason, guardian, traveltime, activities, nursery, romantic, famrel, freetime, goout, Dalc, Walc, G1, G2.


To remove certain variables from the data frame in R, we use the function select() as follows:

data = select(data, -c('famsize', 'Pstatus', 'reason', 'guardian', 'traveltime', 'activities', 'nursery', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'G1', 'G2'))

In order to see if it had an effect we call the function str() :

str(data)  # offers a concise overview of the data set's structure
  'data.frame': 395 obs. of  18 variables:
   $ school   : chr  "GP" "GP" "GP" "GP" ...
   $ sex      : chr  "F" "F" "F" "F" ...
   $ age      : int  18 17 15 15 16 16 16 17 15 15 ...
   $ address  : chr  "U" "U" "U" "U" ...
   $ Medu     : int  4 1 1 4 3 4 2 4 3 3 ...
   $ Fedu     : int  4 1 1 2 3 3 2 4 2 4 ...
   $ Mjob     : chr  "at_home" "at_home" "at_home" "health" ...
   $ Fjob     : chr  "teacher" "other" "other" "services" ...
   $ studytime: int  2 2 2 3 2 2 2 2 2 2 ...
   $ failures : int  0 0 3 0 0 0 0 0 0 0 ...
   $ schoolsup: chr  "yes" "no" "yes" "no" ...
   $ famsup   : chr  "no" "yes" "no" "yes" ...
   $ paid     : chr  "no" "no" "yes" "yes" ...
   $ higher   : chr  "yes" "yes" "yes" "yes" ...
   $ internet : chr  "no" "yes" "yes" "yes" ...
   $ health   : int  3 3 3 5 5 5 3 1 1 5 ...
   $ absences : int  6 4 10 2 4 10 0 6 0 0 ...
   $ G3       : int  6 6 10 15 10 15 11 6 19 15 ...


The insignificant variables have been removed from our data set. However, a large proportion of our variables have been misinterpreted during the importation of the data frame. Specifically, variables that intended to be categorical have been imported as character data types. Accurate data type assignment is crucial since certain operations are valid only with specific data types.


To convert variable data types from character to factor in the data frame in R, we use the function as.factor() as follows:

data$'school' <- as.factor(data$school)
data$'sex' <- as.factor(data$sex)
data$'address' <- as.factor(data$address)
data$'Medu' <- as.factor(data$Medu) 
data$'Fedu' <- as.factor(data$Fedu)
data$'Mjob' <- as.factor(data$Mjob)
data$'Fjob' <- as.factor(data$Fjob)
data$'studytime' <- as.factor(data$studytime)
data$'schoolsup' <- as.factor(data$schoolsup)
data$'famsup' <- as.factor(data$famsup)
data$'paid' <- as.factor(data$paid)
data$'higher' <- as.factor(data$higher)
data$'internet' <- as.factor(data$internet)


We will also rename some of our variables to make the analysis easier to preform. To rename variables in the data frame in R, we use the function names() as follows:

names(data)[5] <- 'medu'
names(data)[6] <- 'fedu' 
names(data)[7] <- 'mjob' 
names(data)[8] <- 'fjob' 
names(data)[18] <- 'grade' 


Again, in order to see if it had an effect we call the function str() :

str(data)  # offers a concise overview of the data set's structure
  'data.frame': 395 obs. of  18 variables:
   $ school   : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
   $ sex      : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
   $ age      : int  18 17 15 15 16 16 16 17 15 15 ...
   $ address  : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
   $ medu     : Factor w/ 5 levels "0","1","2","3",..: 5 2 2 5 4 5 3 5 4 4 ...
   $ fedu     : Factor w/ 5 levels "0","1","2","3",..: 5 2 2 3 4 4 3 5 3 5 ...
   $ mjob     : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
   $ fjob     : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
   $ studytime: Factor w/ 4 levels "1","2","3","4": 2 2 2 3 2 2 2 2 2 2 ...
   $ failures : int  0 0 3 0 0 0 0 0 0 0 ...
   $ schoolsup: Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
   $ famsup   : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
   $ paid     : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
   $ higher   : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
   $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
   $ health   : int  3 3 3 5 5 5 3 1 1 5 ...
   $ absences : int  6 4 10 2 4 10 0 6 0 0 ...
   $ grade    : int  6 6 10 15 10 15 11 6 19 15 ...


We form a final overview of the variables that we want to use to tackle our research question. To create an overview in R, we use the  summary()  function as follows:

summary(data)  # provides an aggregated view of each variable 
   school   sex          age       address medu    fedu          mjob    
   GP:349   F:208   Min.   :15.0   R: 88   0:  3   0:  2   at_home : 59  
   MS: 46   M:187   1st Qu.:16.0   U:307   1: 59   1: 82   health  : 34  
                    Median :17.0           2:103   2:115   other   :141  
                    Mean   :16.7           3: 99   3:100   services:103  
                    3rd Qu.:18.0           4:131   4: 96   teacher : 58  
                    Max.   :22.0                                         
         fjob     studytime    failures      schoolsup famsup     paid    
   at_home : 20   1:105     Min.   :0.0000   no :344   no :153   no :214  
   health  : 18   2:198     1st Qu.:0.0000   yes: 51   yes:242   yes:181  
   other   :217   3: 65     Median :0.0000                                
   services:111   4: 27     Mean   :0.3342                                
   teacher : 29             3rd Qu.:0.0000                                
                            Max.   :3.0000                                
   higher    internet      health         absences          grade      
   no : 20   no : 66   Min.   :1.000   Min.   : 0.000   Min.   : 0.00  
   yes:375   yes:329   1st Qu.:3.000   1st Qu.: 0.000   1st Qu.: 8.00  
                       Median :4.000   Median : 4.000   Median :11.00  
                       Mean   :3.554   Mean   : 5.709   Mean   :10.42  
                       3rd Qu.:5.000   3rd Qu.: 8.000   3rd Qu.:14.00  
                       Max.   :5.000   Max.   :75.000   Max.   :20.00


4.2 Missing values

In this section, we assess the presence of any missing values. To check for the missing values (NA’s), in R, we use the  find.na()  function as follows:

find.na(data)
  [1] " No missing values (NA) in the dataset."


This message indicates that this data frame does not have any missing values (NA’s).


We can also check for the missing values (NA’s), in R, by plotting the missing values through the  gg_miss_var()  function:

gg_miss_var(data)


Based on the displayed plot above, we can again infer that our data frame is devoid of any NA’s, meaning that there are no missing values in our data set.


4.3 Outliers

Outliers or unusual values are observations that deviate significantly from the overall pattern of the dataset. When sifting through a large volume of data, outliers can often become obscured in visual representations like histograms or two-dimensional scatter plots.


After reviewing the data summary, we have reason to believe that the numerical variables grade and absences, and the categorical variables medu and fedu may potentially include outliers.


Numerical variables grade and absences

To investigate this further, let’s examine the histograms and two-dimensional scatter plot for the variables grade and `absences.

ggplot( data = data, aes( x = grade ) ) +
     geom_histogram( bins = 30, color = "#283a53", fill = "#d5e0ec") + 
     labs( title = "Histogram target variable 'grade'" )  + 
     theme(plot.title = element_text(hjust = 0.5))

ggplot( data = data, aes( x = absences ) ) +
     geom_histogram( bins = 30, color = "#283a53", fill = "#d5e0ec") + 
     labs( title = "Histogram variable 'absences'" )  + 
     theme(plot.title = element_text(hjust = 0.5))

gg <- ggplot( data = data, mapping = aes( x = grade, y = absences ) ) + 
    geom_point( colour = 'blue' ) + 
    labs( title = "Two-dimensional scatter plot 'grade' versus 'absences'" )  + 
    theme(plot.title = element_text(hjust = 0.5))

gg <- gg + geom_point(data = data.frame(x = c(0, 8, 9, 11), y = c(0, 56, 75, 54 )), aes(x, y), color = 'red', size = 10, shape = 1)

print(gg) 


Based on the above plots, it seems reasonable to infer that the target variable grade contains outliers. The accumulation of students with a grade of 0 may indicate that these students were either absent during the examination or failed to meet the minimum requirements. Therefore, we will classify the obtained grades of 0 as outliers. Furthermore, there also appears to be outliers in the variable absences. These outliers are characterized by an unusually high number of absences, which deviate from the typical distribution of variable. Therefore, we will classify any number of absences exceeding 40 as an outlier.


To address the unusual values in the grade and absences variables, we will begin by substituting these outliers with missing values. To achieve this, we call the  mutate()  function to replace the variables with a modified copy. The  ifelse()  function is utilized to replace the unusual values with NA (missing values).

data = mutate( data, grade = ifelse( grade == 0, NA, grade ) ) 

data = mutate( data, absences = ifelse( absences > 40, NA, absences ) ) 

Instead of removing the outliers, we have opted for an approach called imputation. In this case, we will impute the missing values by assigning random values in proportion to the records within each category. To accomplish this, we use the  impute()  function as follows:

data $ grade = impute( data $ grade, 'random' ) 

data $ absences = impute( data $ absences, 'random' )

To assess whether the mutation and imputation process had any impact, we will analyze the histograms, two-dimensional scatter plot and the imputed summary of the variables grade and absences after these procedures have been applied:

ggplot( data = data, aes( x = grade ) ) +
     geom_histogram( bins = 30, color = "#283a53", fill = "#d5e0ec" ) + 
     labs( title = "Histogram target variable 'grade'" )  + 
     theme(plot.title = element_text(hjust = 0.5))

ggplot( data = data, aes( x = absences ) ) +
     geom_histogram( bins = 30, color = "#283a53", fill = "#d5e0ec" ) + 
     labs( title = "Histogram variable 'absences'" )  + 
     theme(plot.title = element_text(hjust = 0.5))

ggplot( data = data, mapping = aes( x = grade, y = absences ) ) + 
    geom_point( colour = 'blue' ) + 
    labs( title = "Two-dimensional scatter plot 'grade' versus 'absences'" )  + 
    theme(plot.title = element_text(hjust = 0.5))

summary(data$grade)
  
  Imputed Values:
  
  data$grade 
         n  missing distinct     Info     Mean      Gmd      .05      .10 
        38        0       12    0.979    13.05    3.596     9.85    10.00 
       .25      .50      .75      .90      .95 
     10.25    12.00    15.75    18.00    18.15 
                                                                              
  Value       8.00  8.99  9.98 10.97 11.96 12.95 13.94 14.93 15.92 16.91 17.90
  Frequency      1     1     8     8     3     1     2     4     3     2     3
  Proportion 0.026 0.026 0.211 0.211 0.079 0.026 0.053 0.105 0.079 0.053 0.079
                  
  Value      19.00
  Frequency      2
  Proportion 0.053
  
  For the frequency table, variable is rounded to the nearest 0.11
     Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     4.00   10.00   11.00   11.67   14.00   20.00
summary(data$absences)
  
  Imputed Values:
  
  [1] 3 1 0
     Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.000   0.000   4.000   5.251   8.000  40.000


The outliers for the numerical variables grade and absences appear to have been effectively imputed.


Categorical variables medu and fedu

We might also want to have a look at the categorical variables medu and fedu. Note that the concept of an outlier doesn’t apply to categorical data. However, categories with significantly lower frequencies compared to the others can be considered as outliers within categorical data. To explore this further, let’s examine the frequencies by plotting both histograms.

ggplot( data = data ) + 
    geom_bar( aes( x = medu ), color = "#283a53", fill = c( "#b6d2e4") ) +
    labs( title = "Bar plot for the target variable 'churn'" ) + 
    theme(plot.title = element_text(hjust = 0.5)) + 
    ggtitle("Histogram binary grade classification") 

summary (data$medu)
    0   1   2   3   4 
    3  59 103  99 131


ggplot( data = data ) + 
    geom_bar( aes( x = fedu ), color = "#283a53", fill = c( "#b6d2e4") ) +
    labs( title = "Bar plot for the target variable 'churn'" ) + 
    theme(plot.title = element_text(hjust = 0.5)) + 
    ggtitle("Histogram binary grade classification")  

summary(data$fedu) 
    0   1   2   3   4 
    2  82 115 100  96


Based on the above histograms, it seems reasonable to infer that the variables medu and fedu contain outliers. The category 0, representing individuals with no formal education, contains significantly fewer observations comared to the other categories. Hence, we will identify parents with no formal education as outliers.


To address the unusual values in the medu and fedu variables, we will start by replacing these outliers with missing values (NA’s). Subsequently, rather than removing these outliers, we have chosen to impute them using the mode. To accomplish this, we use the  is.na()  function as follows:

data $ medu[data $ medu == 0] <- NA
data $ medu = impute( data $ medu, 4 )
data $ medu <- droplevels(data$medu)  


data $ fedu[data $ fedu == 0] <- NA
data $ fedu = impute( data $ fedu, 2 )
data $ fedu <- droplevels(data $ fedu)  


To assess whether the mutation and imputation process had any impact, we will analyze the histograms and the imputed summary of the variables medu and fedu after these procedures have been applied:

ggplot( data = data ) + 
    geom_bar( aes( x = medu ), color = "#283a53", fill = c( "#b6d2e4") ) +
    labs( title = "Bar plot for the target variable 'churn'" ) + 
    theme(plot.title = element_text(hjust = 0.5)) + 
    ggtitle("Histogram binary grade classification")  

summary(data$medu) 
    1   2   3   4 
   59 103  99 134
ggplot( data = data ) + 
    geom_bar( aes( x = fedu ), color = "#283a53", fill = c( "#b6d2e4") ) +
    labs( title = "Bar plot for the target variable 'churn'" ) + 
    theme(plot.title = element_text(hjust = 0.5)) + 
    ggtitle("Histogram binary grade classification")  

summary(data$fedu) 
    1   2   3   4 
   82 117 100  96


The outliers for the categorical variables medu and fedu appear to have been effectively imputed.



4.4 Comprehension

The data was gathered during the 2005-2006 school year from two public schools (Gabriel Pereira and Mousinho da Silveira), both situated in the Alentejo region of Portugal. The data was collected as part of a comprehensive survey conducted among students enrolled in mathematics courses at the secondary school level. Our modified dataset contains 395 observations for 16 predictors along with the target variable grade. The variables are described as follows:


  • school: students’ school (Gabriel Pereira or Mousinho da Silveira)

  • sex: students’ sex (female or male)

  • age: students’ age in years (ranging from 15 to 22)

  • address: students’ home address type (urban or rural)

  • medu: mother’s education (scale 0 to 4. 0: none, 1: primary education, 2: 5th to 9th grade, 3: secondary education and 4: higher education)

  • fedu: father’s education (scale 0 to 4. 0: none, 1: primary education, 2: 5th to 9th grade, 3: secondary education and 4: higher education)

  • mjob: mother’s job (teacher, health care related, civil services, at home or other)

  • fjob: father’s job (teacher, health care related, civil services, at home or other)

  • studytime: weekly study time (scale 1 to 4). 1: < 2 hours, 2: 2 to 5 hours, 3: 5 to 10 hours or 4: > 10 hours)

  • failures: number of past class failures (ranging from 0 to 3)

  • schoolsup: extra educational school support (yes or no)

  • famsup: family educational support (yes or no)

  • paid: extra paid classes (yes or no)

  • higher: wants to take higher education (yes or no)

  • internet: internet access at home (yes or no)

  • absences: number of school absences (ranging from 0 to 40)

  • grade: final grade (ranging from 0 to 20)


The variable types can be categorized as follows:


Variable R datatype Statistical datatype
school factor categorical - binary
sex factor categorical - binary
age int numerical - discrete
address factor categorical - binary
medu int numerical - discrete
fedu int numerical - discrete
mjob factor categorical - nominal
fjob factor categorical - nominal
studytime int numerical - discrete
failures int numerical - discrete
schoolsup factor categorical - binary
famsup factor categorical - binary
paid factor categorical - binary
higher factor categorical - binary
internet factor categorical - binary
absences int numerical - discrete
grade int numerical - discrete


5 Exploratory Data Analysis (EDA)

In order to further our understanding about grade and the variables affecting it, we will use graphs, plots, and tables to uncover relationships that are relevant for further investigation. We first start by looking at the general distribution of the grades obtained by high school students in Portugal.


5.1 Investigate the target variable grade

In Portugal, high schools and higher education institutions use a 20-point grading scale. with the minimum passing grade being a 10. This means that students need to score a 10 or higher to pass the Mathematics course. First, we generate a simple overview of the range of the grades obtained by the students.

grade_summary <- capture.output(summary(data$grade))
cleaned_summary <- tail(grade_summary, 2)
cat(cleaned_summary, sep = "\n")
     Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     4.00   10.00   11.00   11.67   14.00   20.00


We will create a density plot to offer a more in-depth view of the distribution of final grades. Density plots are a valuable tool in data visualization for display the distribution of a continuous variables.

ggplot(data = data) + 
  geom_density(aes(x = grade), color = "#283a53", fill = "#d5e0ec", alpha = 0.8) +
  ggtitle("General distribution of the final grades") +
  xlab("Final grades") + ylab("Density") + 
  theme(plot.title = element_text(hjust = 0.5))


The density plot of the target variable grade shown above exhibits a distribution that closely resembles a normal distribution.


In this part, the grades obtained by the students we be categorized into a binary classification. This categorization allows us to create plots where the target variable is binary. We will categorize the grades into the binary classification by calling the function  cut()  as follows:

cutgrade2 <- cut(data $ grade, breaks=c(0, 9, 20), labels = c('Fail', 'Pass'), include.lowest = TRUE) 


Binary classification: a student passed the course if the final grade is greater than or equal to 10; otherwise they failed.

  • Fail: grades 0 to 9

  • Pass: grades 10 to 20

ggplot( data = data ) + 
    geom_bar( aes( x = cutgrade2 ), color = "#283a53", fill = c( "#b6d2e4") ) +
    labs( title = "Bar plot for the target variable 'churn'" ) + 
    theme(plot.title = element_text(hjust = 0.5)) + 
    ggtitle("Histogram binary grade classification") + 
    xlab("classification") + ylab("count") 


Out of the total number of students, 76.2% (301 students) successfully passed the Mathematics course, while 23.8% (94 students) did not meet the passing requirements for the course.


5.2 Determinant of grade 1: demographic factors

Variable school

In this section we investigate the relationship between the variable school and the target variable grade.

summary(data $ school)
   GP  MS 
  349  46


Out of the 395 students, 88.4% (349 students) went to Gabriel Pereira and 11.6% (46 students) went to Mousinho da Silveira.

ggplot( data = data ) +
  geom_boxplot( aes( x = school, y = grade ), fill = c("#b6d2e4", "#f8a5a7") ) +
  scale_x_discrete(labels= c('Gabriel Pereira', 'Mousinho da Silveira')) + 
  xlab("Students' school") + ylab("Final grades") + 
  ggtitle("Boxplot variable school") + 
  theme(plot.title = element_text(hjust = 0.5)) 

ggplot( data = data, aes( x = school, fill = cutgrade2 ) ) +
  geom_bar( position = "fill", color = "#283a53") +
  scale_fill_manual( values = c( "#f8a5a7", "#b6d2e4" ) ) + 
  theme(plot.title = element_text(hjust = 0.5)) + 
  ggtitle("Proportional histogram variable age") 


The above plots do not provide compelling graphical evidence of the predictive importance of the variable school. It appears that the grade distributions at the two schools are approximately equal. To investigate this further, we will conduct a hypothesis test to assess the potential dependency between the variables school and grade.

t.test(grade ~ school, data = data)
  
    Welch Two Sample t-test
  
  data:  grade by school
  t = 1.8636, df = 60.222, p-value = 0.06726
  alternative hypothesis: true difference in means between group GP and group MS is not equal to 0
  95 percent confidence interval:
   -0.06464906  1.82931831
  sample estimates:
  mean in group GP mean in group MS 
          11.77364         10.89130


We do not reject the null-hypothesis since the p-value >𝛼=0.05. Thus, at a significance level of 5%, we have found insufficient evidence to infer that the mean final grade distribution between Gabriel Pereira (GP) and Mousinho da Silveira (MS) are different. Based on the two-sample T-test there doesn’t not seem to be a significant relationship between the variable school and the target variable grade.


Variable sex

In this section we investigate the relationship between the variable sex and the target variable grade.

summary(data$sex) 
    F   M 
  208 187


Out of the 395 students, 52.7% (208 students) are female and 47.3% (187 students) are male.

ggplot( data = data ) +
  geom_boxplot( aes( x = sex, y = grade ), fill = c("#f8a5a7", "#b6d2e4") ) +
  scale_x_discrete(labels= c('Female', 'Male')) + 
  xlab("Students' school") + ylab("Final grades") + 
  ggtitle("Boxplot variable sex") + 
  theme(plot.title = element_text(hjust = 0.5)) 

ggplot( data = data ) +
  geom_density( aes( x = grade, fill = sex ), alpha = 0.3 ) + 
  ggtitle("Density plot variable sex") + 
  theme(plot.title = element_text(hjust = 0.5)) 


The above plots do not provide compelling graphical evidence of the predictive importance of the variable sex. It appears that the grade distributions between females and males are approximately equal. To investigate this further, we will conduct a hypothesis test to assess the potential dependency between the variables sex and grade.

t.test(grade ~ sex, data = data)
  
    Welch Two Sample t-test
  
  data:  grade by sex
  t = -1.1663, df = 390.46, p-value = 0.2442
  alternative hypothesis: true difference in means between group F and group M is not equal to 0
  95 percent confidence interval:
   -1.0239690  0.2614228
  sample estimates:
  mean in group F mean in group M 
         11.49038        11.87166


We do not reject the null-hypothesis since the p-value >𝛼=0.05. Thus, at a significance level of 5%, we have found insufficient evidence to infer that the mean final grades for male and female students are different. Based on the two-sample T-test there does not seems to be a significant relationship between the variable sex and the target variable grade. Hence, H1 will be rejected.

H2h: Students of the sex Male perform academically better than their female peers has been rejected!


Variable age*

In this section we investigate the relationship between the variable age and the target variable grade.

summary(data $ age)
     Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     15.0    16.0    17.0    16.7    18.0    22.0
ggplot( data = data ) + 
  geom_bar( aes( x = age, fill = cutgrade2 ), color = "#283a53" ) +
  scale_fill_manual( values = c( "#f8a5a7", "#b6d2e4" ) ) + 
  theme(plot.title = element_text(hjust = 0.5)) + 
  ggtitle("Histogram variable age") 

ggplot( data = data, aes( x = age, fill = cutgrade2 ) ) +
  geom_bar( position = "fill", color = "#283a53") +
  scale_fill_manual( values = c( "#f8a5a7", "#b6d2e4" ) ) + 
  theme(plot.title = element_text(hjust = 0.5)) + 
  ggtitle("Proportional histogram variable age")  

ggplot(data=data, aes(x=age, y=grade))+
  geom_point(color="#87adc4") +
  geom_smooth(method = "lm", se = FALSE) +
  theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) + 
  ggtitle("Scatterplot grade versus age") 


The above plots indicate graphical evidence of the predictive importance of the variable age. It seems to be the case that older students fail classes more often than than younger classmates. To investigate this further, we will conduct a hypothesis test to assess the potential dependency between the variables age and grade.

cor.test( x = data $ grade, 
          y = data $ age, 
          alternative = "less", 
          conf.level = 0.95 ) 
  
    Pearson's product-moment correlation
  
  data:  data$grade and data$age
  t = -2.6252, df = 393, p-value = 0.004499
  alternative hypothesis: true correlation is less than 0
  95 percent confidence interval:
   -1.00000000 -0.04892303
  sample estimates:
         cor 
  -0.1312777


We reject the null-hypothesis since the p-value<𝛼=0.05. Thus, at a significance level of 5%, we have found sufficient evidence to infer that there exists a significant negative relationship between the age of students and their academic grades. Hence, H2 will be accepted.


5.3 Determinant of grade 2: educational factors

Variable studytime*

In this section we investigate the relationship between the variable studytime and the target variable grade. Studytime is a categorical variable representing the weekly study time (scale 1 to 4).

summary (data $ studytime)
    1   2   3   4 
  105 198  65  27


The scales represent the following:

  • 1: represents students that studied 2 hours or less a week

  • 2: represents students that studied between 2 to 5 hours a week

  • 3: represents students that studied between 5 to 10 hours a week

  • 4: represents students that studied more than 10 hours a week

ggplot( data = data ) + 
  geom_bar( aes( x = studytime, fill = cutgrade2 ), color = "#283a53" ) +
  scale_fill_manual( values = c( "#f8a5a7", "#b6d2e4" ) ) + 
  ggtitle("Histogram variable studytime") + 
  theme(plot.title = element_text(hjust = 0.5))  

ggplot( data = data, aes( x = studytime, fill = cutgrade2 ) ) +
  geom_bar( position = "fill", color = "#283a53") +
  scale_fill_manual( values = c( "#f8a5a7", "#b6d2e4" ) ) + 
  ggtitle("Proportional histogram variable studytime") + 
  theme(plot.title = element_text(hjust = 0.5)) 


The above plots indicate graphical evidence of the predictive importance of the variable studytime. It appears to be the case that students who dedicate more time to their studies tend to achieve higher grades compared to those who study for shorter durations. To investigate this further, we will conduct a hypothesis test to assess the potential dependency between the variables studytime and grade.


anova = aov( grade ~ studytime, data = data, alternative = 'greater' )
summary( anova )
               Df Sum Sq Mean Sq F value Pr(>F)   
  studytime     3    134    44.6    4.33 0.0051 **
  Residuals   391   4027    10.3                  
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


We reject the null-hypothesis since the p-value \(< \alpha=0.05\). Therefore, at a significance level of 5%, we have found sufficient evidence to infer that there seems to be a significant positive relationship between the variable studytime and the target variable grade.Hence, H3 will be accepted.


Variable failures*

In this section we investigate the relationship between the variable failures and the target variable grade.

table(data $ failures)
  
    0   1   2   3 
  312  50  17  16
ggplot( data = data ) + 
  geom_bar( aes( x = failures, fill = cutgrade2 ), color = "#283a53" ) +
  scale_fill_manual( values = c( "#f8a5a7", "#b6d2e4" ) ) + 
  ggtitle("Histogram variable failures") + 
  theme(plot.title = element_text(hjust = 0.5)) 

ggplot( data = data, aes( x = failures, fill = cutgrade2 ) ) +
  geom_bar( position = "fill", color = "#283a53") +
  scale_fill_manual( values = c( "#f8a5a7", "#b6d2e4" ) ) + 
  ggtitle("Proportional histogram variable failures") + 
  theme(plot.title = element_text(hjust = 0.5)) 

ggplot(data=data, aes(x=failures, y=grade))+
  geom_point(color="#87adc4") +
  geom_smooth(method = "lm", se = FALSE) +
  theme_classic() + 
  ggtitle("Scatterplot grade versus failures") + 
  theme(plot.title = element_text(hjust = 0.5)) 


The above plots indicate graphical evidence of the predictive importance of the variable failures. It appears to be the case that students who have failed their class tend to achieve lower grades compared to those who have not failed their classes. To investigate this further, we will conduct a correlation test to assess the potential dependency between the variables failures and grade.

cor.test( x = data $ grade, 
          y = data $ failures, 
          alternative = "less", 
          conf.level = 0.95 )
  
    Pearson's product-moment correlation
  
  data:  data$grade and data$failures
  t = -4.6299, df = 393, p-value = 2.489e-06
  alternative hypothesis: true correlation is less than 0
  95 percent confidence interval:
   -1.0000000 -0.1473185
  sample estimates:
         cor 
  -0.2274285


We reject the null-hypothesis since the p-value<𝛼=0.05. Thus, at a significance level of 5%, we have found sufficient evidence to infer that there exists an inverse relationship between the variables failures and grade. In other words, an increase in the number of past class failures correspond to lower academic performance. Hence, H4 will be accepted.


Variable absences*

In this section we investigate the relationship between the variable absences and the target variable grade.

absences_summary <- capture.output(summary(data$absences))
cleaned_summary <- tail(absences_summary, 2)
cat(cleaned_summary, sep = "\n")
     Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.000   0.000   4.000   5.251   8.000  40.000
ggplot( data = data ) +
  geom_density( aes( x = absences, fill = cutgrade2 ), alpha = 0.3 ) + 
  ggtitle("Density plot variable absences") + 
  theme(plot.title = element_text(hjust = 0.5)) 

ggplot(data=data, aes(x=absences, y=grade))+
  geom_point(color="#87adc4") +
  geom_smooth(method = "lm", se = FALSE) +
  theme_classic() + 
  ggtitle("Scatterplot grade versus absences") + 
  theme(plot.title = element_text(hjust = 0.5)) 


The above plots indicate graphical evidence of the predictive importance of the variable absences. It seems to be the case that students that were absent more than 10 times score way lower grades compared to those that were absent less than 10 times. To investigate this further, we will conduct a hypothesis test to assess the potential dependency between the variables absences and grade.

cor.test( x = data $ grade, 
          y = data $ absences, 
          alternative = "less", 
          conf.level = 0.95 )
  
    Pearson's product-moment correlation
  
  data:  data$grade and data$absences
  t = -4.8196, df = 393, p-value = 1.029e-06
  alternative hypothesis: true correlation is less than 0
  95 percent confidence interval:
   -1.00000 -0.15641
  sample estimates:
         cor 
  -0.2362345


We reject the null-hypothesis since the p-value<𝛼=0.05. Thus, at a significance level of 5%, we have found sufficient evidence to infer that there exists an inverse relationship between the variables absences and grade. Hence, H5 will be accepted.


Variable higher*

In this section we investigate the relationship between the variable higher and the target variable grade.

ggplot( data = data ) + 
  geom_bar( aes( x = higher, fill = cutgrade2 ), color = "#283a53" ) +
  scale_fill_manual( values = c( "#f8a5a7", "#b6d2e4" ) ) + 
  ggtitle("Histogram variable higher") + 
  theme(plot.title = element_text(hjust = 0.5)) 


Out of the 395 students, 94.9% (375 students) of the students want to pursue higher education and 5.1% (20 students) are not planning to pursue higher education.

ggplot( data = data ) +
  geom_boxplot( aes( x = higher, y = grade ), fill = c("#f8a5a7", "#b6d2e4") ) +
  scale_x_discrete(labels= c('no', 'yes')) + 
  xlab("pursue higher education") + ylab("final grades") + 
  ggtitle("Boxplot variable internet") + 
  theme(plot.title = element_text(hjust = 0.5)) 

ggplot( data = data ) +
  geom_density( aes( x = grade, fill = higher ), alpha = 0.3 ) + 
  ggtitle("Density plot variable higher") + 
  theme(plot.title = element_text(hjust = 0.5)) 


The above plots indicate graphical evidence of the predictive importance of the variable higher. It appears that students who aspire to pursue higher education tend to achieve higher grades compared to those who do not. To investigate this further, we will conduct a hypothesis test to assess the potential dependency between the variables higher and grade.

t.test(grade ~ higher, data = data, alternative = 'less')
  
    Welch Two Sample t-test
  
  data:  grade by higher
  t = -4.0384, df = 26.974, p-value = 0.0002002
  alternative hypothesis: true difference in means between group no and group yes is less than 0
  95 percent confidence interval:
         -Inf -0.9872031
  sample estimates:
   mean in group no mean in group yes 
           10.05000          11.75733


We reject the null-hypothesis since the p-value <𝛼=0.05. Thus, at a significance level of 5%, we have found insufficient evidence to infer that the mean final grade of students with aspirations for higher education is higher than those who do not have such aspirations. Based on the two-sample T-test there seems to be a significant positive relationship between the variable higher and the target variable grade. Hence, H6 will be accepted.


5.4 Determinant of grade 3: environmental factors

Variable internet*

In this section we investigate the relationship between the variable internet and the target variable grade.

ggplot( data = data ) + 
  geom_bar( aes( x = internet, fill = cutgrade2 ), color = "#283a53" ) +
  scale_fill_manual( values = c( "#f8a5a7", "#b6d2e4" ) ) + 
  ggtitle("Histogram variable internet") + 
  theme(plot.title = element_text(hjust = 0.5)) 


Out of the 395 students, 83.3% (329 students) have access to internet and 16.7% (66 students) do not have access to internet.

ggplot( data = data ) +
  geom_boxplot( aes( x = internet, y = grade ), fill = c("#f8a5a7", "#b6d2e4") ) +
  scale_x_discrete(labels= c('no access', 'access')) + 
  xlab("access to internet") + ylab("final grades") + 
  ggtitle("Boxplot variable internet") + 
  theme(plot.title = element_text(hjust = 0.5)) 

ggplot( data = data ) +
  geom_density( aes( x = grade, fill = internet ), alpha = 0.3 ) + 
  ggtitle("Density plot variable internet") + 
  theme(plot.title = element_text(hjust = 0.5)) 


The above plots indicate slight graphical evidence of the predictive importance of the variable internet. It appears to be the case that students who have do not have access to internet score lower grades compared to those students that do have access to internet. To investigate this further, we will conduct a correlation test to assess the potential dependency between the variables internet and grade.

t.test(grade ~ internet, data = data, alternative = 'less')
  
    Welch Two Sample t-test
  
  data:  grade by internet
  t = -1.9169, df = 96.666, p-value = 0.0291
  alternative hypothesis: true difference in means between group no and group yes is less than 0
  95 percent confidence interval:
        -Inf -0.107632
  sample estimates:
   mean in group no mean in group yes 
           11.00000          11.80547


We do not reject the null-hypothesis since the p-value <𝛼=0.05. Thus, at a significance level of 5%, we have found sufficient evidence to infer that students with access to internet tend to achieve higher grades than those without access to internet. Based on the two-sample T-test there seems to be a significant positive relationship between the variable internet and the target variable grade. Hence, H7 will be accepted. Note, however, that the two-sided hypothesis test for the variable internet results in insignificance at a 5% significance level.


Variable address*

In this section we investigate the relationship between the variable address and the target variable grade.

ggplot( data = data ) + 
  geom_bar( aes( x = address, fill = cutgrade2 ), color = "#283a53" ) +
  scale_fill_manual( values = c( "#f8a5a7", "#b6d2e4" ) ) + 
  ggtitle("Histogram variable address") + 
  theme(plot.title = element_text(hjust = 0.5)) 


Out of the 395 students, 77.7% (307 students) of the students live in urban areas and 22.3% (88 students) live in rural areas.

ggplot( data = data ) +
  geom_boxplot( aes( x = address, y = grade ), fill = c("#f8a5a7", "#b6d2e4") ) +
  scale_x_discrete(labels= c('Rural (R)', 'Urban (U)')) + 
  xlab("Address type") + ylab("final grades") + 
  ggtitle("Boxplot variable address") + 
  theme(plot.title = element_text(hjust = 0.5)) 

ggplot( data = data ) +
  geom_density( aes( x = grade, fill = address ), alpha = 0.3 ) + 
  ggtitle("Density plot variable address") + 
  theme(plot.title = element_text(hjust = 0.5)) 


The above plots indicate graphical evidence of the predictive importance of the variable addres. It appears to be the case that students who live in urban areas score higher grades compared to students that live in rural areas. To investigate this further, we will conduct a correlation test to assess the potential dependency between the variables addres and grade.

t.test(grade ~ address, data = data, alternative = 'less')
  
    Welch Two Sample t-test
  
  data:  grade by address
  t = -2.4924, df = 145.67, p-value = 0.006904
  alternative hypothesis: true difference in means between group R and group U is less than 0
  95 percent confidence interval:
         -Inf -0.3193428
  sample estimates:
  mean in group R mean in group U 
         10.93182        11.88274


We reject the null-hypothesis since the p-value <𝛼=0.05. Thus, at a significance level of 5%, we have found sufficient evidence to infer that students that reside in an urban area perform academically better than those residing in a rural area.. Based on the two-sample T-test there seems to be a significant relationship between the variable addres and the target variable grade. Hence, H8 will be accepted.


5.5 Determinant of grade 4: parental education and occupation

Variable medu*

In this section we investigate the relationship between the variable meduand the target variable grade. Medu is are categorical variables representing the highest attained education level of the student (scale 1 to 4).

summary(data $ medu)
    1   2   3   4 
   59 103  99 134


The scales represent the following:

  • 1: finished primary education

  • 2: finished 5th to 9th grade of primary education

  • 3: finished secondary education

  • 4: finished higher education

ggplot( data = data, aes( x = medu, y = grade )) + 
  geom_boxplot( fill = c("#c8e5c7", "#b6d2e4", "#f8a5a7", '#ffca28') ) +
  labs( x = "mothers' education", y = "final grade" ) + 
  ggtitle("Boxplot variable medu") + 
  theme(plot.title = element_text(hjust = 0.5)) 

ggplot( data = data ) +
  geom_density( aes( x = grade, fill = medu ), alpha = 0.3,  ) + 
  scale_fill_manual(values = c("#c8e5c7", "#b6d2e4", "#f8a5a7", '#ffca28')) + 
  ggtitle("Density plot variable medu") + 
  theme(plot.title = element_text(hjust = 0.5)) 

The above plots indicate graphical evidence of the predictive importance of the variable medu. It appears to be the case that students who’s mother have had a higher education score higher grades compared to students whose mother have had a lower education. To investigate this further, we will conduct an ANOVA test to assess the potential dependency between the variables medu and grade.

anova = aov( grade ~ medu, data = data )
summary( anova )
               Df Sum Sq Mean Sq F value  Pr(>F)   
  medu          3    126   41.93   4.063 0.00732 **
  Residuals   391   4035   10.32                   
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We reject the null-hypothesis since the p-value <𝛼=0.05 .Therefore, at a significance level of 5%, we have found sufficient evidence to infer that there sems to be a significant positive relationship between the variable medu and the target variable grade.


Variable fedu*

In this section we investigate the relationship between the variable feduand the target variable grade. Fedu is a categorical variable representing the highest attained education level of the father of the student (scale 1 to 4).

summary(data $ fedu)
    1   2   3   4 
   82 117 100  96


The scales represent the following:

  • 1: finished primary education

  • 2: finished 5th to 9th grade of primary education

  • 3: finished secondary education

  • 4: finished higher education

ggplot( data = data, aes( x = fedu, y = grade )) + 
  geom_boxplot( fill = c("#c8e5c7", "#b6d2e4", "#f8a5a7", '#ffca28') ) +
  labs( x = "fathers' education", y = "final grade" ) + 
  ggtitle("Boxplot variable fedu") + 
  theme(plot.title = element_text(hjust = 0.5)) 

ggplot( data = data ) +
  geom_density( aes( x = grade, fill = fedu ), alpha = 0.3 ) + 
  ggtitle("Density plot variable fedu") + 
  scale_fill_manual(values = c("#c8e5c7", "#b6d2e4", "#f8a5a7", '#ffca28')) + 
  theme(plot.title = element_text(hjust = 0.5)) 


It again appears to be the case that students who’s father have had a higher education score higher grades compared to students whose father have had a lower education. To investigate this further, we will conduct an ANOVA test to assess the potential dependency between the variables fedu and grade.

anova = aov( grade ~ fedu, data = data )
summary( anova )
               Df Sum Sq Mean Sq F value Pr(>F)  
  fedu          3    118   39.29   3.799 0.0105 *
  Residuals   391   4043   10.34                 
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We reject the null-hypothesis since the p-value <𝛼=0.05 .Therefore, at a significance level of 5%, we have found sufficient evidence to infer that there seems to be a significant positive relationship between the variable fedu and the target variable grade.


Students with parents having higher levels of education tend to achieve higher academic performance compared to students with parents having lower levels of education. Hence, H9 will be accepted.


Variable mjob*

In this section we investigate the relationship between the variable mjoband the target variable grade. Mjob is a nominal variable representing the type of job the mother is doing.

summary(data $ mjob)
   at_home   health    other services  teacher 
        59       34      141      103       58


The type of jobs are the following:

  • Teacher (higher-educated occupation)

  • Health care related (higher-educated occupation)

  • Civil services

  • At home

  • Other

ggplot( data = data, aes( x = mjob, y = grade )) + 
  geom_boxplot( fill = c("#c8e5c7", "#b6d2e4", "#f8a5a7", '#ffca28', 'green') ) +
  labs( x = "mothers' occupation", y = "final grade" ) + 
  ggtitle("Boxplot variable mjob") + 
  theme(plot.title = element_text(hjust = 0.5)) 

ggplot( data = data ) +
  geom_density( aes( x = grade, fill = mjob ), alpha = 0.3 ) + 
  scale_fill_manual(values = c("#c8e5c7", "#b6d2e4", "#f8a5a7", '#ffca28', 'green')) + 
  ggtitle("Density plot variable mjob") + 
  theme(plot.title = element_text(hjust = 0.5)) 


The above plots indicate graphical evidence of the predictive importance of the variable mjob. It seems to be the case that student’s who’s mother are working in the health care score significantly higher compared to students whose parents are not working in healthcare. To investigate this further, we will conduct a single linear regression to assess the potential dependency between the variables mjob and the target variable grade.

mjob.lm <- lm(grade ~ mjob, data = data)
summary(mjob.lm)
  
  Call:
  lm(formula = grade ~ mjob, data = data)
  
  Residuals:
      Min      1Q  Median      3Q     Max 
  -7.1986 -2.1471 -0.1986  2.0840  7.8136 
  
  Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
  (Intercept)  11.18644    0.41834  26.740  < 2e-16 ***
  mjobhealth    1.96062    0.69189   2.834  0.00484 ** 
  mjobother     0.01214    0.49824   0.024  0.98057    
  mjobservices  0.85239    0.52465   1.625  0.10504    
  mjobteacher   0.60666    0.59417   1.021  0.30788    
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Residual standard error: 3.213 on 390 degrees of freedom
  Multiple R-squared:  0.03225, Adjusted R-squared:  0.02232 
  F-statistic: 3.249 on 4 and 390 DF,  p-value: 0.0122


When testing this through a single variable linear regression, we see that students whose mother work in health care has an effect at a significance level of 5% since the p-value < 0.05. Both the plots and linear model suggest that mjob is a good predictor for the target variable grade.


Variable fjob

In this section we investigate the relationship between the variable fjoband the target variable grade. Fjob is a nominal variable representing the type of job the father is doing of the student.

summary(data $ fjob)
   at_home   health    other services  teacher 
        20       18      217      111       29


The type of jobs are the following:

  • Teacher (higher-educated occupation)

  • Health care related (higher-educated occupation)

  • Civil services

  • At home

  • Other

ggplot( data = data, aes( x = fjob, y = grade )) + 
  geom_boxplot( fill = c("#c8e5c7", "#b6d2e4", "#f8a5a7", '#ffca28', 'green') ) +
  labs( x = "fathers' occupation", y = "final grade" ) + 
  ggtitle("Boxplot variable fjob") + 
  theme(plot.title = element_text(hjust = 0.5)) 

ggplot( data = data ) +
  geom_density( aes( x = grade, fill = fjob ), alpha = 0.3 ) + 
  ggtitle("Density plot variable fjob") + 
   scale_fill_manual(values = c("#c8e5c7", "#b6d2e4", "#f8a5a7", '#ffca28', 'green')) + 
  theme(plot.title = element_text(hjust = 0.5)) 


The above plots indicate graphical evidence of the predictive importance of the variable fjob. It seems to be the case that student’s who’s father are working as a teacher score significantly higher compared to students whose father is not working as a teaccher To investigate this further, we will conduct an ANOVA test to assess the potential dependency between the variables fjob and the target variable grade.


fjob.lm <- lm(grade ~ fjob, data = data)
summary(fjob.lm)
  
  Call:
  lm(formula = grade ~ fjob, data = data)
  
  Residuals:
      Min      1Q  Median      3Q     Max 
  -7.5069 -2.0474 -0.5069  2.4931  8.4505 
  
  Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
  (Intercept)   11.7500     0.7226  16.262   <2e-16 ***
  fjobhealth    -0.1389     1.0499  -0.132   0.8948    
  fjobother     -0.2431     0.7551  -0.322   0.7477    
  fjobservices  -0.2005     0.7850  -0.255   0.7986    
  fjobteacher    1.5948     0.9392   1.698   0.0903 .  
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Residual standard error: 3.231 on 390 degrees of freedom
  Multiple R-squared:  0.02137, Adjusted R-squared:  0.01133 
  F-statistic: 2.129 on 4 and 390 DF,  p-value: 0.07656

When testing this through a single variable linear regression, we see that all categories within the variable are not significant at a significance level of 5% since the p-value is > 0.05. The plots and the linear regression model suggest that fjob is not a good predictor for the target variable grade.


Students with parents having higher-educated occupations (teacher and health-care) tend to achieve higher academic performance compared to students with parents having lower-educated occupations (civil-service, at home and other). Hence, H10 will be accepted.


5.6 Determinant of grade 5: educational and social support

Variable schoolsup*

In this section we investigate the relationship between the variable schoolsup and the target variable grade.

ggplot( data = data ) + 
  geom_bar( aes( x = schoolsup, fill = cutgrade2 ), color = "#283a53" ) +
  scale_fill_manual( values = c( "#f8a5a7", "#b6d2e4" ) ) + 
  ggtitle("Histogram variable schoolsup") + 
  theme(plot.title = element_text(hjust = 0.5)) 


Out of the 395 students, 87.1% (344 students) did not get support from school and 12.9% (51 students) did get support from school.

ggplot( data = data ) +
  geom_boxplot( aes( x = schoolsup, y = grade ), fill = c("#f8a5a7", "#b6d2e4") ) +
  scale_x_discrete(labels= c('no support', 'support')) + 
  xlab("school support") + ylab("final grades") +
  ggtitle("Boxplot variable schoolsup") + 
  theme(plot.title = element_text(hjust = 0.5)) 

ggplot( data = data ) +
  geom_density( aes( x = grade, fill = schoolsup ), alpha = 0.3 ) + 
  ggtitle("Density plot variable schoolsup") + 
  theme(plot.title = element_text(hjust = 0.5)) 


The above plots indicate graphical evidence of the predictive importance of the variable schoolsup. It appears to be the case that students who received support from school score significantly lower compared to the students who did not get any support from school. To investigate this further, we will conduct a Two-Sampled T-test test to assess the potential dependency between the variables schoolsup and grade.

t.test(grade ~ schoolsup, data = data, alternative = 'greater')
  
    Welch Two Sample t-test
  
  data:  grade by schoolsup
  t = 5.9418, df = 76.59, p-value = 3.907e-08
  alternative hypothesis: true difference in means between group no and group yes is greater than 0
  95 percent confidence interval:
   1.688888      Inf
  sample estimates:
   mean in group no mean in group yes 
          11.973837          9.627451


We reject the null-hypothesis since the p-value <𝛼=0.05. Thus, at a significance level of 5%, we have found sufficient evidence to infer that students who receive extra educational school support to achieve lower grades than those who do not receive such support. Based on the two-sample T-test there seems to be a significant negative relationship between the variable schoolsup and the target variable grade. Hence, H11 will be rejected.


Variable famsup

In this section we investigate the relationship between the variable famsup and the target variable grade.

ggplot( data = data ) + 
  geom_bar( aes( x = famsup, fill = cutgrade2 ), color = "#283a53" ) +
  scale_fill_manual( values = c( "#f8a5a7", "#b6d2e4" ) ) + 
  ggtitle("Histogram variable famsup") + 
  theme(plot.title = element_text(hjust = 0.5)) 


Out of the 395 students, 61.3% (242 students) received support from family and 38.7% (153 students) students did not receive support from family.

ggplot( data = data ) +
  geom_boxplot( aes( x = famsup, y = grade ), fill = c("#f8a5a7", "#b6d2e4") ) +
  scale_x_discrete(labels= c('no support', 'support')) + 
  xlab("family support") + ylab("final grades") + 
  ggtitle("Boxplot variable famsup") + 
  theme(plot.title = element_text(hjust = 0.5)) 

ggplot( data = data ) +
  geom_density( aes( x = grade, fill = famsup ), alpha = 0.3 ) + 
  ggtitle("Density plot variable famsup") + 
  theme(plot.title = element_text(hjust = 0.5)) 


The above plots do not indicate graphical evidence of the predictive importance of the variable famsup. It appears to be the case that students who received support from family score the same as students who did receive support from family. To investigate this further, we will conduct a Two-Sampled T-test test to assess the potential dependency between the variables famsup and grade.

t.test(grade ~ famsup, data = data, alternative = 'less')
  
    Welch Two Sample t-test
  
  data:  grade by famsup
  t = 1.1028, df = 333.76, p-value = 0.8646
  alternative hypothesis: true difference in means between group no and group yes is less than 0
  95 percent confidence interval:
        -Inf 0.9146439
  sample estimates:
   mean in group no mean in group yes 
           11.89542          11.52893


We do not reject the null-hypothesis since the p-value >𝛼=0.05. Thus, at a significance level of 5%, we have found insufficient evidence to infer students that receive family related support achieve higher grades compared to those that do not receive such support. Based on the two-sample T-test there does not seems to be a significant positive relationship between the variable famsup and the target variable grade. Hence, H12 will be rejected.


Variable paid

In this section we investigate the relationship between the variable paid and the target variable grade.

describe(data $ paid)
  data$paid 
         n  missing distinct 
       395        0        2 
                        
  Value         no   yes
  Frequency    214   181
  Proportion 0.542 0.458
ggplot( data = data ) + 
  geom_bar( aes( x = paid, fill = cutgrade2 ), color = "#283a53" ) +
  scale_fill_manual( values = c( "#f8a5a7", "#b6d2e4" ) ) + 
  ggtitle("Histogram variable paid") + 
  theme(plot.title = element_text(hjust = 0.5)) 


Out of the 395 students, 45.8% (181 students) took extra paid classes and 54.2% (214 students) did not take extra paid classes.

ggplot( data = data ) +
  geom_boxplot( aes( x = paid, y = grade ), fill = c("#f8a5a7", "#b6d2e4") ) +
  scale_x_discrete(labels= c('no', 'yes')) + 
  xlab("extra paid classes") + ylab("final grades") + 
  ggtitle("Boxplot variable paid") + 
  theme(plot.title = element_text(hjust = 0.5)) 

ggplot( data = data ) +
  geom_density( aes( x = grade, fill = paid ), alpha = 0.3 ) + 
  ggtitle("Density plot variable paid") + 
  theme(plot.title = element_text(hjust = 0.5)) 


The above plots do not indicate graphical evidence of the predictive importance of the variable paid. The boxplot and denisty functions seem to be identical for both groups. It appears to be the case that students who took extra paid classes score the same as students who did not take those extra paid. To investigate this further, we will conduct a Two-Sampled T-test test to assess the potential dependency between the variables paid and grade.

t.test(grade ~ paid, data = data, alternative = 'less')
  
    Welch Two Sample t-test
  
  data:  grade by paid
  t = 0.41906, df = 388.64, p-value = 0.6623
  alternative hypothesis: true difference in means between group no and group yes is less than 0
  95 percent confidence interval:
        -Inf 0.6758202
  sample estimates:
   mean in group no mean in group yes 
           11.73364          11.59669


We do not reject the null-hypothesis since the p-value >𝛼=0.05. Thus, at a significance level of 5%, we have found insufficient evidence to infer students that took extra paid classes achieved higher grades than those who did not. Based on the two-sample T-test there does not seems to be a significant positive relationship between the variable paid and the target variable grade. Hence, H13 will be rejected.


6 Data Preparation

In this section we prepare our data frame for modeling. We partition our data set randomly into two groups: a train set (80%) and a test set (20%). We use the  partition() ` function from the liver package as follows:

set.seed(1)

data_sets = partition( data = data, prob = c( 0.8, 0.2 ) )

train_set = data_sets $ part1
test_set  = data_sets $ part2

actual_test  = test_set $ grade


Note that here we are using the  set.seed()  function in this particular context to establish the reproducibility of our results.


To validate the effectiveness of the partitioning process, we will compare the means of our target variable grade between the train set and test set. To assess whether the population means in the grades obtained by students differs between the two data sets, we will conduct a Two-sample T-test for the difference in means. The hypotheses are as follows:

\[ \bigg\{ \begin{matrix} H_0: \mu_1 = \mu_2 \\ H_1: \mu_1 \neq \mu_2 \end{matrix} \]



To run the Two-sample T-test, we use the  t.test()  function as follows:

t.test(train_set $ grade, test_set $ grade, var.equal=TRUE)
  
    Two Sample t-test
  
  data:  train_set$grade and test_set$grade
  t = 0.2525, df = 393, p-value = 0.8008
  alternative hypothesis: true difference in means is not equal to 0
  95 percent confidence interval:
   -0.7266068  0.9407451
  sample estimates:
  mean of x mean of y 
   11.69040  11.58333


We do not reject the null-hypothesis since the p-value \(> \alpha=0.05\). Therefore, at a significance level of 5%, we have found insufficient evidence to infer that the mean difference in grades obtained by students are different between the train set and test set.


To further validate the partition we test whether the variance in the grades obtained by students differs between the two data sets. To assess whether the variance in the grades obtained by students differs between the two data sets, we will conduct a F-test for the difference in variances. The hypotheses are as follows:

\[ \bigg\{ \begin{matrix} H_0: \sigma_1 = \sigma_2 \\ H_1: \sigma_1 \neq \sigma_2 \end{matrix} \]



To run the F-test, we use the  var.test()  function as follows:

var.test(train_set $ grade, test_set $ grade, alternative = "two.sided")
  
    F test to compare two variances
  
  data:  train_set$grade and test_set$grade
  F = 1.2556, num df = 322, denom df = 71, p-value = 0.2476
  alternative hypothesis: true ratio of variances is not equal to 1
  95 percent confidence interval:
   0.8524131 1.7714166
  sample estimates:
  ratio of variances 
            1.255587


We again do not reject the null-hypothesis since the p-value \(> \alpha=0.05\). Therefore, at a significance level of 5%, we have found insufficient evidence to infer that the variance in grades obtained by students are different between the train set and test set.


The results of both the Two-sample T-test for the difference in means and the F-test for the difference in variances provide significant evidence that the partition for the target variable grade has been validated.


7 Modeling

In “Exploratory Data Analysis (EDA)” the 16 predicting variables of the data frame were inspected. The EDA indicates that the following predictors might be significant in drawing conclusion about the effect of socioeconomic status on the academic performance of students: age, address, fedu, medu, mjob, studytime, failures, schoolsup, paid, higher and absences.


With the mentioned variables the hypotheses will be tested with different type of Machine Learning algorithms. This will be done with the following formula:

formula_significant = grade ~ (age + address + medu + fedu + mjob + studytime + failures + schoolsup + paid + higher + absences)

7.1 Multiple Linear Regression

The multiple linear regression will be created using the lm function. This will give the following equation:

\[ \hat{y} = b_0 + b_1 x_1 + b_2 x_2 + b_3x_3 + b_4x_4 + b_5x_5 + b_6x_6 + b_7x_7 + b_8x_8 + b_9x_9 + b_{10}x_{10} + b_{11}x_{11} \]


To explore the linear regression results, run the summary() command with the name of the specified model.

reg_sign = lm( formula_significant, data = train_set )
summary( reg_sign )
  
  Call:
  lm(formula = formula_significant, data = train_set)
  
  Residuals:
      Min      1Q  Median      3Q     Max 
  -6.7756 -2.1003 -0.0619  1.8515  7.3738 
  
  Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
  (Intercept)  12.31428    2.86590   4.297 2.34e-05 ***
  age          -0.09833    0.15140  -0.649 0.516546    
  addressU      0.81030    0.42413   1.911 0.057013 .  
  medu2         0.51107    0.57726   0.885 0.376673    
  medu3         0.49407    0.64745   0.763 0.446002    
  medu4         0.98941    0.76251   1.298 0.195425    
  fedu2         0.37074    0.50154   0.739 0.460351    
  fedu3         0.11499    0.56842   0.202 0.839813    
  fedu4         0.72620    0.63757   1.139 0.255600    
  mjobhealth    0.73330    0.80956   0.906 0.365760    
  mjobother    -0.50101    0.53895  -0.930 0.353322    
  mjobservices  0.37116    0.59728   0.621 0.534793    
  mjobteacher  -0.85252    0.77903  -1.094 0.274679    
  studytime2   -0.32565    0.41426  -0.786 0.432416    
  studytime3    0.86778    0.54233   1.600 0.110624    
  studytime4    1.66591    0.78228   2.130 0.034017 *  
  failures     -0.59632    0.24676  -2.417 0.016259 *  
  schoolsupyes -2.53991    0.50129  -5.067 7.06e-07 ***
  paidyes      -0.65197    0.35384  -1.843 0.066371 .  
  higheryes     1.00085    0.78596   1.273 0.203852    
  absences     -0.10606    0.02820  -3.761 0.000203 ***
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Residual standard error: 2.965 on 302 degrees of freedom
  Multiple R-squared:  0.2497,  Adjusted R-squared:    0.2 
  F-statistic: 5.025 on 20 and 302 DF,  p-value: 9.99e-11


As the results above shows are schoolsup and absences significant at \(\alpha=0.001\). Failures and studytime4 are significant at \(\alpha=0.05\). The following predictors are all not significant at \(\alpha=0.05\): age, address, fedu, medu, mjob, studytime (apart from studytime4), paid and higher. In the model above, the residual standard error is approximately 2.965. This means that the linear regression model, with all significant predictors, predicts the grade of the student with an average error of 2.965. The adjusted R-squared is approximately 0.2, indicating that about 20.00% of the variance in the target variable grade is explained by the predicting variables included in the model.


7.2 Stepwise linear regression

With stepwise linear regression it is possible to add or remove predicting variables based on their significance. This will be done to identify the most relevant variables for predicting the target variable. The function that will be used is step() with direction both, so that predicting variables will be added or removed from the regression.


The results of the step-wise linear regression can be looked at calling the summary function:

reg_stepwise = step( reg_sign, direction = "both" )
  Start:  AIC=722.47
  grade ~ (age + address + medu + fedu + mjob + studytime + failures + 
      schoolsup + paid + higher + absences)
  
              Df Sum of Sq    RSS    AIC
  - medu       3    15.639 2671.1 718.36
  - fedu       3    17.421 2672.8 718.58
  - age        1     3.709 2659.1 720.92
  - higher     1    14.258 2669.7 722.20
  <none>                   2655.4 722.47
  - mjob       4    79.455 2734.9 723.99
  - paid       1    29.852 2685.3 724.08
  - address    1    32.094 2687.5 724.35
  - failures   1    51.349 2706.8 726.65
  - studytime  3   103.019 2758.4 728.76
  - absences   1   124.356 2779.8 735.25
  - schoolsup  1   225.729 2881.2 746.82
  
  Step:  AIC=718.36
  grade ~ age + address + fedu + mjob + studytime + failures + 
      schoolsup + paid + higher + absences
  
              Df Sum of Sq    RSS    AIC
  - age        1     3.709 2674.8 716.81
  - fedu       3    38.809 2709.9 717.02
  - higher     1    12.347 2683.4 717.85
  <none>                   2671.1 718.36
  - paid       1    26.866 2697.9 719.60
  - mjob       4    83.576 2754.6 720.31
  - address    1    36.696 2707.8 720.77
  + medu       3    15.639 2655.4 722.47
  - failures   1    57.641 2728.7 723.26
  - studytime  3   112.081 2783.1 725.64
  - absences   1   116.670 2787.7 730.17
  - schoolsup  1   227.021 2898.1 742.71
  
  Step:  AIC=716.81
  grade ~ address + fedu + mjob + studytime + failures + schoolsup + 
      paid + higher + absences
  
              Df Sum of Sq    RSS    AIC
  - fedu       3    42.519 2717.3 715.91
  - higher     1    15.934 2690.7 716.73
  <none>                   2674.8 716.81
  - paid       1    26.853 2701.6 718.04
  + age        1     3.709 2671.1 718.36
  - mjob       4    84.726 2759.5 718.88
  - address    1    40.381 2715.2 719.65
  + medu       3    15.640 2659.1 720.92
  - failures   1    62.605 2737.4 722.28
  - studytime  3   112.296 2787.1 724.10
  - absences   1   135.649 2810.4 730.79
  - schoolsup  1   227.262 2902.0 741.15
  
  Step:  AIC=715.91
  grade ~ address + mjob + studytime + failures + schoolsup + paid + 
      higher + absences
  
              Df Sum of Sq    RSS    AIC
  <none>                   2717.3 715.91
  - higher     1    20.191 2737.5 716.30
  + fedu       3    42.519 2674.8 716.81
  + age        1     7.420 2709.9 717.02
  + medu       3    38.665 2678.6 717.28
  - paid       1    29.038 2746.3 717.34
  - address    1    42.008 2759.3 718.86
  - mjob       4   108.092 2825.4 720.51
  - studytime  3   105.564 2822.9 722.22
  - failures   1    81.586 2798.9 723.46
  - absences   1   137.244 2854.5 729.82
  - schoolsup  1   221.817 2939.1 739.25
summary(reg_stepwise)
  
  Call:
  lm(formula = grade ~ address + mjob + studytime + failures + 
      schoolsup + paid + higher + absences, data = train_set)
  
  Residuals:
       Min       1Q   Median       3Q      Max 
   -6.7806  -2.1537  0.0297*   1.9296   7.5318 
  
  Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
  (Intercept)  11.056469   0.864954  12.783  < 2e-16 ***
  addressU      0.909050   0.415923   2.186  0.02959 *  
  mjobhealth    1.436078   0.709903   2.023  0.04394 *  
  mjobother    -0.342567   0.512123  -0.669  0.50405    
  mjobservices  0.789537   0.541137   1.459  0.14557    
  mjobteacher   0.004827   0.633383   0.008  0.99392    
  studytime2   -0.370854   0.408066  -0.909  0.36416    
  studytime3    0.885870   0.529945   1.672  0.09561 .  
  studytime4    1.544417   0.771251   2.002  0.04611 *  
  failures     -0.725731   0.238263  -3.046  0.00252 ** 
  schoolsupyes -2.409775   0.479809  -5.022 8.65e-07 ***
  paidyes      -0.639157   0.351732  -1.817  0.07016 .  
  higheryes     1.157629   0.763977   1.515  0.13073    
  absences     -0.107179   0.027130  -3.951 9.67e-05 ***
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Residual standard error: 2.965 on 309 degrees of freedom
  Multiple R-squared:  0.2322,  Adjusted R-squared:  0.1999 
  F-statistic: 7.188 on 13 and 309 DF,  p-value: 2.921e-12


The result of the stepwise regression is that we keep address, mjob, studytime, failures, schoolsup, paid, higher and absences as predictors for the final grade. These variables are significant with the following significance: * \(\alpha=0.001\): schoolsup and absences * \(\alpha=0.01\): failures * \(\alpha=0.05\): studytime4, mjobhealth and address Conclusion is that mjob (apart from mjobhealth), studytime (apart from studytime4), paid and higher are not significant at \(\alpha=0.05\). The variables that are dropped from the regression are age, medu and fedu.


When comparing both models, the stepwise model provides a lower adjusted r-squared than of the original regression model, of 0.0001. Conclusion that can be drawn is that more of the variance is explained with Multiple Linear Regression, however this difference is minimal. The stepwise regression explains the dependent variable grade to almost the same extent with less variables and thus should be the preferred model in this case.


The final model will be defined as above:

reg_final = step( reg_sign, direction = "both" )
  Start:  AIC=722.47
  grade ~ (age + address + medu + fedu + mjob + studytime + failures + 
      schoolsup + paid + higher + absences)
  
              Df Sum of Sq    RSS    AIC
  - medu       3    15.639 2671.1 718.36
  - fedu       3    17.421 2672.8 718.58
  - age        1     3.709 2659.1 720.92
  - higher     1    14.258 2669.7 722.20
  <none>                   2655.4 722.47
  - mjob       4    79.455 2734.9 723.99
  - paid       1    29.852 2685.3 724.08
  - address    1    32.094 2687.5 724.35
  - failures   1    51.349 2706.8 726.65
  - studytime  3   103.019 2758.4 728.76
  - absences   1   124.356 2779.8 735.25
  - schoolsup  1   225.729 2881.2 746.82
  
  Step:  AIC=718.36
  grade ~ age + address + fedu + mjob + studytime + failures + 
      schoolsup + paid + higher + absences
  
              Df Sum of Sq    RSS    AIC
  - age        1     3.709 2674.8 716.81
  - fedu       3    38.809 2709.9 717.02
  - higher     1    12.347 2683.4 717.85
  <none>                   2671.1 718.36
  - paid       1    26.866 2697.9 719.60
  - mjob       4    83.576 2754.6 720.31
  - address    1    36.696 2707.8 720.77
  + medu       3    15.639 2655.4 722.47
  - failures   1    57.641 2728.7 723.26
  - studytime  3   112.081 2783.1 725.64
  - absences   1   116.670 2787.7 730.17
  - schoolsup  1   227.021 2898.1 742.71
  
  Step:  AIC=716.81
  grade ~ address + fedu + mjob + studytime + failures + schoolsup + 
      paid + higher + absences
  
              Df Sum of Sq    RSS    AIC
  - fedu       3    42.519 2717.3 715.91
  - higher     1    15.934 2690.7 716.73
  <none>                   2674.8 716.81
  - paid       1    26.853 2701.6 718.04
  + age        1     3.709 2671.1 718.36
  - mjob       4    84.726 2759.5 718.88
  - address    1    40.381 2715.2 719.65
  + medu       3    15.640 2659.1 720.92
  - failures   1    62.605 2737.4 722.28
  - studytime  3   112.296 2787.1 724.10
  - absences   1   135.649 2810.4 730.79
  - schoolsup  1   227.262 2902.0 741.15
  
  Step:  AIC=715.91
  grade ~ address + mjob + studytime + failures + schoolsup + paid + 
      higher + absences
  
              Df Sum of Sq    RSS    AIC
  <none>                   2717.3 715.91
  - higher     1    20.191 2737.5 716.30
  + fedu       3    42.519 2674.8 716.81
  + age        1     7.420 2709.9 717.02
  + medu       3    38.665 2678.6 717.28
  - paid       1    29.038 2746.3 717.34
  - address    1    42.008 2759.3 718.86
  - mjob       4   108.092 2825.4 720.51
  - studytime  3   105.564 2822.9 722.22
  - failures   1    81.586 2798.9 723.46
  - absences   1   137.244 2854.5 729.82
  - schoolsup  1   221.817 2939.1 739.25
summary(reg_final)
  
  Call:
  lm(formula = grade ~ address + mjob + studytime + failures + 
      schoolsup + paid + higher + absences, data = train_set)
  
  Residuals:
       Min       1Q   Median       3Q      Max 
   -6.7806  -2.1537  0.0297*   1.9296   7.5318 
  
  Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
  (Intercept)  11.056469   0.864954  12.783  < 2e-16 ***
  addressU      0.909050   0.415923   2.186  0.02959 *  
  mjobhealth    1.436078   0.709903   2.023  0.04394 *  
  mjobother    -0.342567   0.512123  -0.669  0.50405    
  mjobservices  0.789537   0.541137   1.459  0.14557    
  mjobteacher   0.004827   0.633383   0.008  0.99392    
  studytime2   -0.370854   0.408066  -0.909  0.36416    
  studytime3    0.885870   0.529945   1.672  0.09561 .  
  studytime4    1.544417   0.771251   2.002  0.04611 *  
  failures     -0.725731   0.238263  -3.046  0.00252 ** 
  schoolsupyes -2.409775   0.479809  -5.022 8.65e-07 ***
  paidyes      -0.639157   0.351732  -1.817  0.07016 .  
  higheryes     1.157629   0.763977   1.515  0.13073    
  absences     -0.107179   0.027130  -3.951 9.67e-05 ***
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Residual standard error: 2.965 on 309 degrees of freedom
  Multiple R-squared:  0.2322,  Adjusted R-squared:  0.1999 
  F-statistic: 7.188 on 13 and 309 DF,  p-value: 2.921e-12


The final grade=11.056469 + 0.909050 * addressU + 1.436078 * mjobhealth - 0.342567 * mjobother + 0.789537 * mjobservices - 0.004827 * mjobteacher - 0.370854 * studytime2 + 0.885870 * studytime3 + 1.544417 * studytime4 - 0.725731 * failures - 2.409775 * schoolsupyes - 0.639157 * paidyes + 1.157629 * higheryes - 0.107179 * absences.


7.3 Verification of Modelling Assumptions

To implement the model, certain assumptions have to be verified. The assumptions in question are: linearity, independence, normality, constant variance. These assumptions will be checked by inspecting the following plots:

plot( reg_final )  


Upon analyzing the residuals versus fits plot (upper-left plot), we found no evident patterns such as curvature or systematic nonuniformity in the vertical spread of points. Therefore, we confirm that the linearity and constant variance assumptions hold true for this example. As you can see in the Normal Q-Q plot, most of the points in the normal probability plot do line up on a straight line, so our normality assumption is accepted For this dataset, the independence assumption means that the grade of one student should not be dependent on the grades of other students. Each student’s grade should be independent of the grades of their peers. Therefore the independence assumption is satisfied.


7.4 Regression with Decision Tree by CART

Here, we want to classify what a students’ grade is, based on the earlier listed set of predictors. So, we produce a decision tree based on the CART algorithm. Here, we use the rpart() function from the rpart package:

tree_cartG = rpart( formula = formula_significant, data = train_set, method = "anova" )
print( tree_cartG )
  n= 323 
  
  node), split, n, deviance, yval
        * denotes terminal node
  
    1) root 323 3539.04000 11.690400  
      2) schoolsup=yes 46  297.32610  9.717391  
        4) fedu=1,4 22  111.86360  8.772727 *
        5) fedu=2,3 24  147.83330 10.583330 *
      3) schoolsup=no 277 3032.91000 12.018050  
        6) failures>=0.5 57  542.66670 10.333330  
         12) absences>=1 38  320.84210  9.631579 *
         13) absences< 1 19  165.68420 11.736840 *
        7) failures< 0.5 220 2286.54500 12.454550  
         14) mjob=at_home,other,teacher 145 1463.97200 11.813790  
           28) absences>=13.5 9   67.55556  8.777778 *
           29) absences< 13.5 136 1307.97100 12.014710  
             58) fedu=1,3 61  645.31150 11.245900  
              116) age>=16.5 34  324.50000 10.500000  
                232) absences>=0.5 24  158.95830  9.708333 *
                233) absences< 0.5 10  114.40000 12.400000 *
              117) age< 16.5 27  278.07410 12.185190 *
             59) fedu=2,4 75  597.28000 12.640000 *
         15) mjob=health,services 75  647.94670 13.693330  
           30) absences>=7.5 19  138.94740 11.947370 *
           31) absences< 7.5 56  431.42860 14.285710  
             62) medu=1,2 14   35.21429 12.642860 *
             63) medu=3,4 42  345.83330 14.833330  
              126) studytime=1,2 31  265.41940 14.225810  
                252) age>=15.5 20  168.20000 13.300000 *
                253) age< 15.5 11   48.90909 15.909090 *
              127) studytime=3,4 11   36.72727 16.545450 *


This decision tree can be plotted using the rpart.plot() function from the rpart.plot package:

rpart.plot( tree_cartG, type = 4, extra = 100 )


The tree above has 13 decision nodes and 15 leaves. The CART algorithm has selected 8 predictors (schoolsup, failures, mjob, age, fedu, absences, medu and studytime) as important predictors to predict the mean grade. Where, the variable schoolsup is the most important predictor on predicting the mean grade, since it is the root node of the tree. Within the algorithm there can 15 predictions be found of the mean grade (15 leaves).


7.5 Random Forest algorithm

With the random forest algorithm, there will be made a series of decision trees and these will be combined into one final classification. This will be done with all the significant variables, we run the random forest algorithm, using the randomForest() function from the randomForest package:

random_forest_grade <- randomForest(formula_significant, data=train_set, importance=TRUE, proximity=TRUE)

The importance of the variables can be visualized by inspecting the dot-chart as as measured by the random forest algorithm. Importance, best predictors, is listed in this order schoolsup failures, absences, age, paid mjob, studytime, higher, medu, fedu and address.

varImpPlot(random_forest_grade)

importance(random_forest_grade)
              %IncMSE IncNodePurity
  age        6.620826     303.96570
  address    1.136158     102.42985
  medu       4.770631     272.40380
  fedu       3.981114     294.47902
  mjob       5.594522     342.42802
  studytime  5.445337     304.44290
  failures  10.742009     211.11192
  schoolsup 22.201919     201.75613
  paid       6.583295     125.52535
  higher     5.083429      33.14082
  absences   9.283945     521.12491


8 Model Evaluation

To evaluate the models that have been used in this research, there are two parameters that will be inspected: mean squares of error (MSE) and adjusted R-squared.


8.1 MSE

predict_reg_final = predict(reg_final, test_set)
(mse(predict_reg_final, actual_test))
  [1] 6.975618

The MSE = 6.975618, this means there is an error rate of 6.83119. As mentioned earlier, the adjusted R-squared of the linear regression model is 0.1999, which means that 19,99% of the variance in the mean final grade can be explained by the predictors.


8.2 CART

predict_cart = predict( tree_cartG, test_set )
(mse(predict_cart, actual_test))
  [1] 7.850942
rsq_cart = (cor(predict_cart, actual_test))**2
rsq_cart
  [1] 0.1323372

The MSE = 7.850942, this means there is an error rate of 7.850942. The R-squared of the cart algorithm is 0.1323372, indicating that 13,23% of the variance in the the mean final grade can be explained by the predictors.


8.3 Random Forest

print(random_forest_grade)
  
  Call:
   randomForest(formula = formula_significant, data = train_set,      importance = TRUE, proximity = TRUE) 
                 Type of random forest: regression
                       Number of trees: 500
  No. of variables tried at each split: 3
  
            Mean of squared residuals: 9.268307
                      % Var explained: 15.41

The MSE = 9.268307, this means there is an error rate of 9.268307. The variance in the mean final grade explained by the predictors, R-squared of the random forest algorithm, is 15,41%.


In studies the ‘better-performed model’ has a lower mean squares of error and a higher adjusted R-squared. However the difference between the ‘better-performed model’, Multiple linear regression, and the Stepwise Regression is a minimal difference of 0.0001 in the higher adjusted R-squared, but the stepwise regression needs less variables to predict the dependent variable grade. With this normalization, the chosen model is the Stepwise Regression. grade = The final grade= 11.056469 + 0.909050 * addressU + 1.436078 * mjobhealth - 0.342567 * mjobother + 0.789537 * mjobservices - 0.004827 * mjobteacher - 0.370854 * studytime2 + 0.885870 * studytime3 + 1.544417 * studytime4 - 0.725731 * failures - 2.409775 * schoolsupyes - 0.639157 * paidyes + 1.157629 * higheryes - 0.107179 * absences.


8.4 Subsection conclusion and hypotheses

Given the model, we can conclude that living in a urban environment, a mother with a job in health care or civil services, weekly study time of 5 hours or more and ambition to take higher education have a positive impact on their grade, while a mother with a job as other or teacher, studying 2 to 5 hours per week, number of past class failures, extra educational school support, extra paid classes and number of school absences have a negative impact on their grade.


The most important predictors in this model are the following: failures, schoolsup and absences. For every number of past class failures, the grade can decrease with 0.725731. For extra educational school support, the negative effect of receiving this extra support on their grade is 2.409775. For school absences, every absence can have a negative effect of 0.107179 on their final grade.


The hypotheses that have been accepted are the following: H2b (positive relationship between parents with higher-educated occupations and the grade), H2c (negative relationship between receiving extra educational school support and the grade), H2f (positive relationship between living in a urban area and the grade), (partly) H2k (positive relationship between weekly studytime and the grade), H2l (negative relationship between number of school absences and the grade) and H2m (negative relationship between number of past class failures and the grade), these hypotheses are coherent with our regression model.


Our results do not align with all stated hypotheses, indicating a discrepancy between the expected outcomes based on our hypotheses and the actual findings from our study. The statistically insignificant ‘mother and father education’, famsup, paid, internet, sex, age and higher at \(\alpha=0.05\) respectively contradict H2a (positive relationship between parents with higher-education and the grade), H2d (positive relationship between receiving family educational school support and the grade), H2e (positive relationship between following extra paid classes and the grade), H2g (positive relationship between having access to internet and the grade), H2h (negative relationship between sex male and the grade), H2i (negative relationship between age and the grade), H2j (positive relationship between ambition to higher education and the grade). We have rejected these hypothesis.


9 Closing thoughts

The research question under consideration focused on the impact of various socioeconomic factors on the academic performance of high school students in Portugal. Recognizing the influence of socioeconomic factors on academic achievement is crucial for devising specific policies to narrow the divide between advantaged and disadvantaged students. By tackling these discrepancies, educational systems can strive for equal opportunities, promoting a fairer and more equitable society for all. Therefore, this research considered the following research question: To what extent do socioeconomic factors, including demographic characteristics, educational context, environmental conditions, parental education and occupation, and educational and social support , impact the academic performance of high school students in Portugal? To answer this question, we first examined the overall academic performance of high school students in taking the Mathematics course Portugal and then categorized the predictors into five socioeconomic determinants: demographic characteristics, educational context, environmental conditions, parental education and occupation, and educational and social support. We subsequently conducted a detailed exploration of each variable in our Explanatory Data Analysis (EDA).


Our findings revealed that a majority of high school students, that is 76.2%, successfully passed their Mathematics course, while 23.8% did not meet the passing requirements. During our Explanatory Data Analysis (EDA), we observed that among the demographic factors, only the variable age showed significance. However, educational and environmental factors such as studytime, failures, absences, higher, and address, all played a substantial role in influencing academic performance. Additionally, parental education emerged as a significant factor affecting students’ grades. When it comes to educational and social support, we found that only the variable schoolsup had a significant impact on academic performance.


In summary, our results suggest that the following variables play a significant role in determining students’ grades: age, studytime, failures, attendance, higher, internet, address, parental education (measured by medu and fedu), fathers occupation (measured by mjob), and schoolsup. Consequently, we accepted hypotheses H2b, H2c, H2f, H2k, H2l, and H2m, and rejected hypothesis H2a, H2d, H2e, H2g, H2h, H2i, H2j. These findings underscore the complex nature of academic performance, where numerous variables influence the circumstances in which students live and study, ultimately affecting their grades.


Conducting further research can help identify additional factors that might have been overlooked in this research. Factors such as cultural influences, community support, and individual resilience could play significant roles but might not have been adequately explored. By delving deeper into these aspects, researchers can develop a more comprehensive understanding of the complex nature of socioeconomic influences on academic performance.


10 Reference list


Amponsah, K. D., Aboagye, G. K., Narh-Kert, M., Commey-Mintah, P., & Boateng, F. K. (2022). The Impact of Internet Usage on Students’ Success in Selected Senior High Schools in Cape Coast metropolis, Ghana. European Journal of Educational Sciences, 9(2), 1–18. https://doi.org/10.19044/ejes.v9no2a1


Boztas, S. (2023, July 7). Education or “selecting the new elite”? Criticism of primary school testing - DutchNews.nl. DutchNews.nl. https://www.dutchnews.nl/2023/04/education-or-selecting-the-new-elite-criticism-of-primary-school-testing/


Diris, L. B. R. (2022). Ongelijkheid in het Nederlandse onderwijs door de jaren heen. ESB. https://esb.nu/ongelijkheid-in-het-nederlandse-onderwijs-door-de-jaren-heen/


European Commission. (2023, September 14). Education in Portugal. Retrieved October 1, 2023, from https://eurydice.eacea.ec.europa.eu/national-education-systems/portugal/overview


Hayes, S. (2014). Where smart people live: What your chosen home says about your intelligence. Bustle. https://www.bustle.com/articles/30626-where-smart-people-live-what-your-chosen-home-says-about-your-intelligence#


Idris, M., Hussain, S., & Ahmad, N. (2020). Relationship between Parents’ Education and their children’s Academic Achievement. Journal of Arts and Social Sciences, 7(2), 82–92. https://doi.org/10.46662/jass-vol7-iss2-2020(82-92


Ministerie van Onderwijs, Cultuur en Wetenschap. (2023, January 12). Later selecteren, beter differentiëren. Advies | Onderwijsraad. https://www.onderwijsraad.nl/publicaties/adviezen/2021/04/15/later-selecteren-beter-differentieren


Owuor, C. A., Simatwa, E. M. W., & Ndolo, M. A. (2022). Influence of Parental Occupation on Student Academic Performance in Public Secondary Schools in Kenya: A Case Study of Homa Bay Sub County. International Journal of Current Research, 14(02), 20878-20885.


Sara. (2023, June 19). 6 Common Types Of Students To Enrol In Tutoring | Oxford Learning. Oxford Learning. https://www.oxfordlearning.com/types-of-tutoring-students/


The Annie E. Casey Foundation. (2022, December 16). Parental involvement in your child’s education. https://www.aecf.org/blog/parental-involvement-is-key-to-student-success-research-shows#


Leon, C. (2018). The Relation between Absences and Grades: A Statistical Analysis Munich Personal RePEc Archive. https://mpra.ub.uni-muenchen.de/84655/


Ullah, R., & Ullah, H. (2019). Boys versus girls’ educational performance: Empirical evidences from global north and global south. African Educational Research Journal, 7(4), 163–167. https://doi.org/10.30918/aerj.74.19.036